Intro to Data Science Final Project

Author

Cova, Langbehn, Villa & Barros

Data Visualization and Initial Model Set Up


Setup and Data Loading: Packages

Code
library(tidymodels)   
library(tidyverse)   
library(janitor)   
library(naniar)       
library(assertr)      
library(corrplot)
library(gridExtra)

## Turn off scientific notation for readable numbers
options(scipen = 999)

Load the Data

Filler out Missing Values in our key predicted outcome as this will cause our predicted models to fail, it provides no addtiaional data to our model. However for missingness in our predictors, imputation will help ensure non-systemic missingness adversely effects the sample size of our model.

Code
## Read in the CSV file
remittances <- read_csv("../data/remittances.csv") |>
  filter(!is.na(remittances_gdp))

Create Training and Testing Splits

Code
## Set seed for reproducibility
set.seed(20251211)

## Split data: 80% training, 20% testing
remit_split <- initial_split(data = remittances, prop = 0.8)
remit_train <- training(x = remit_split) 
remit_test <- testing(x = remit_split)

STEP 1: Data Overview

Get a Quick Look at the Data Structure

Code
## View the first few rows and column types
glimpse(remit_train)
Rows: 3,292
Columns: 19
$ `Country Name`     <chr> "Germany", "Brazil", "Namibia", "Mexico", "Tanzania…
$ year               <dbl> 2004, 2012, 2009, 2000, 2023, 2003, 2019, 2022, 201…
$ remittances        <dbl> 6515541641, 2784072055, 76511275, 7524742980, 75881…
$ remittances_gdp    <dbl> 0.22842973, 0.11293366, 0.85594119, 1.01403249, 0.9…
$ `Country Code`     <chr> "DEU", "BRA", "NAM", "MEX", "TZA", "NAM", "CHE", "L…
$ gdp                <dbl> 2852317768062, 2465227802807, 8938847189, 742061329…
$ stock              <dbl> 1093030.29, 323363.20, NA, 8072288.08, NA, NA, 2336…
$ unemployment       <dbl> 10.727, 7.251, 22.254, 2.646, 2.582, 22.052, 4.394,…
$ gdp_per            <dbl> 34566.7359, 12521.7213, 4302.9137, 7524.0271, 1224.…
$ inflation          <dbl> 1.1550593, 7.9431269, 6.9454043, 10.6979646, 2.7482…
$ vulnerable_emp     <dbl> 7.044898, 25.346971, 28.026294, 31.785613, 83.87807…
$ maternal_mortality <dbl> 6, 60, 547, 56, 276, 290, 6, 477, 68, 101, 41, 842,…
$ exchange_rate      <dbl> 0.8039216, 1.9530686, 8.5228198, 9.4555583, 2383.04…
$ deportations       <dbl> 91, 639, 1, 44564, 19, 1, 4, 1, 4, NA, 53, 63, 533,…
$ internet           <dbl> 64.73000, 48.56000, 6.50000, 5.08138, 29.06380, 3.3…
$ poverty            <dbl> 0.0, 6.4, 38.6, 16.3, NA, 47.6, 0.0, NA, NA, 38.2, …
$ dist_pop           <dbl> 6035.334, 7694.307, 11720.190, 3369.053, NA, 11720.…
$ dist_cap           <dbl> 6717.542, 6794.436, 11908.000, 3037.916, NA, 11908.…
$ terror             <dbl> 2, 4, 2, 3, NA, 3, 1, 3, 2, 2, 2, 3, 2, 2, 2, 2, 4,…

The training data has 3,918 rows and 19 columns. We can see:

  • Character columns: country_name, country_code
  • Numeric columns: year, remittances, gdp, unemployment, etc.

Calculate Summary Statistics

Code
## Get min, max, mean, median for all variables
summary(remit_train)
 Country Name            year       remittances           remittances_gdp     
 Length:3292        Min.   :1994   Min.   :        6038   Min.   :  0.000029  
 Class :character   1st Qu.:2002   1st Qu.:    74764744   1st Qu.:  0.366894  
 Mode  :character   Median :2010   Median :   545314436   Median :  1.608458  
                    Mean   :2010   Mean   :  2655647852   Mean   :  4.070970  
                    3rd Qu.:2017   3rd Qu.:  2042426540   3rd Qu.:  4.620723  
                    Max.   :2024   Max.   :137674533896   Max.   :108.402724  
                                                                              
 Country Code            gdp                     stock           
 Length:3292        Min.   :      37184925   Min.   :     259.6  
 Class :character   1st Qu.:    6495301634   1st Qu.:   39746.2  
 Mode  :character   Median :   25703593810   Median :   97162.5  
                    Mean   :  303934455901   Mean   :  388798.9  
                    3rd Qu.:  168683410868   3rd Qu.:  260319.4  
                    Max.   :18316765021700   Max.   :23126089.8  
                                             NA's   :1460        
  unemployment       gdp_per           inflation        vulnerable_emp   
 Min.   : 0.100   Min.   :   109.6   Min.   : -32.741   Min.   : 0.1257  
 1st Qu.: 3.829   1st Qu.:  1356.2   1st Qu.:   1.834   1st Qu.:13.8649  
 Median : 6.232   Median :  4324.5   Median :   4.197   Median :33.2140  
 Mean   : 7.820   Mean   : 12367.7   Mean   :  12.021   Mean   :38.8326  
 3rd Qu.:10.491   3rd Qu.: 14711.9   3rd Qu.:   8.589   3rd Qu.:61.1256  
 Max.   :34.007   Max.   :138935.0   Max.   :4800.532   Max.   :94.7169  
 NA's   :167                         NA's   :32         NA's   :267      
 maternal_mortality exchange_rate         deportations        internet     
 Min.   :   1.0     Min.   :    0.0004   Min.   :    0.0   Min.   :  0.00  
 1st Qu.:  14.0     1st Qu.:    1.7617   1st Qu.:    7.0   1st Qu.:  3.53  
 Median :  61.0     Median :    8.2770   Median :   34.0   Median : 23.00  
 Mean   : 172.7     Mean   :  390.1688   Mean   :  989.7   Mean   : 33.84  
 3rd Qu.: 225.5     3rd Qu.:  116.3786   3rd Qu.:  134.0   3rd Qu.: 62.60  
 Max.   :5721.0     Max.   :15236.8847   Max.   :90504.0   Max.   :100.00  
 NA's   :153        NA's   :34           NA's   :360       NA's   :223     
    poverty         dist_pop          dist_cap         terror     
 Min.   : 0.00   Min.   :  548.4   Min.   :  737   Min.   :1.000  
 1st Qu.: 0.25   1st Qu.: 6035.3   1st Qu.: 6274   1st Qu.:1.000  
 Median : 1.50   Median : 7873.0   Median : 8081   Median :2.000  
 Mean   :10.33   Mean   : 8471.3   Mean   : 8618   Mean   :2.345  
 3rd Qu.:11.55   3rd Qu.:11381.9   3rd Qu.:11674   3rd Qu.:3.000  
 Max.   :89.40   Max.   :16180.3   Max.   :16371   Max.   :5.000  
 NA's   :1949    NA's   :145       NA's   :145     NA's   :148    

Key observations:

  • Years range from 1994 to 2024 (30 years of data)
  • Remittances range from $11,470 to $137.7 billion per county.

Check Internal Structure

Code
## Detailed structure of the data
str(remit_train)
spc_tbl_ [3,292 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Country Name      : chr [1:3292] "Germany" "Brazil" "Namibia" "Mexico" ...
 $ year              : num [1:3292] 2004 2012 2009 2000 2023 ...
 $ remittances       : num [1:3292] 6515541641 2784072055 76511275 7524742980 758814528 ...
 $ remittances_gdp   : num [1:3292] 0.228 0.113 0.856 1.014 0.96 ...
 $ Country Code      : chr [1:3292] "DEU" "BRA" "NAM" "MEX" ...
 $ gdp               : num [1:3292] 2852317768062 2465227802807 8938847189 742061329749 79062403837 ...
 $ stock             : num [1:3292] 1093030 323363 NA 8072288 NA ...
 $ unemployment      : num [1:3292] 10.73 7.25 22.25 2.65 2.58 ...
 $ gdp_per           : num [1:3292] 34567 12522 4303 7524 1224 ...
 $ inflation         : num [1:3292] 1.16 7.94 6.95 10.7 2.75 ...
 $ vulnerable_emp    : num [1:3292] 7.04 25.35 28.03 31.79 83.88 ...
 $ maternal_mortality: num [1:3292] 6 60 547 56 276 290 6 477 68 101 ...
 $ exchange_rate     : num [1:3292] 0.804 1.953 8.523 9.456 2383.043 ...
 $ deportations      : num [1:3292] 91 639 1 44564 19 ...
 $ internet          : num [1:3292] 64.73 48.56 6.5 5.08 29.06 ...
 $ poverty           : num [1:3292] 0 6.4 38.6 16.3 NA 47.6 0 NA NA 38.2 ...
 $ dist_pop          : num [1:3292] 6035 7694 11720 3369 NA ...
 $ dist_cap          : num [1:3292] 6718 6794 11908 3038 NA ...
 $ terror            : num [1:3292] 2 4 2 3 NA 3 1 3 2 2 ...
 - attr(*, "spec")=
  .. cols(
  ..   `Country Name` = col_character(),
  ..   year = col_double(),
  ..   remittances = col_double(),
  ..   remittances_gdp = col_double(),
  ..   `Country Code` = col_character(),
  ..   gdp = col_double(),
  ..   stock = col_double(),
  ..   unemployment = col_double(),
  ..   gdp_per = col_double(),
  ..   inflation = col_double(),
  ..   vulnerable_emp = col_double(),
  ..   maternal_mortality = col_double(),
  ..   exchange_rate = col_double(),
  ..   deportations = col_double(),
  ..   internet = col_double(),
  ..   poverty = col_double(),
  ..   dist_pop = col_double(),
  ..   dist_cap = col_double(),
  ..   terror = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

All numeric variables are stored as num or dbl (double precision), and text variables are stored as chr (character).


STEP 2: Column Names and Types

View Original Column Names

Code
## Check current column names
names(remit_train)
 [1] "Country Name"       "year"               "remittances"       
 [4] "remittances_gdp"    "Country Code"       "gdp"               
 [7] "stock"              "unemployment"       "gdp_per"           
[10] "inflation"          "vulnerable_emp"     "maternal_mortality"
[13] "exchange_rate"      "deportations"       "internet"          
[16] "poverty"            "dist_pop"           "dist_cap"          
[19] "terror"            

Some column names have spaces and capital letters.

Clean Column Names to Snake Case

Code
## Convert to lowercase with underscores
remit_train <- remit_train |>
  clean_names()

## Verify the cleaned names
names(remit_train)
 [1] "country_name"       "year"               "remittances"       
 [4] "remittances_gdp"    "country_code"       "gdp"               
 [7] "stock"              "unemployment"       "gdp_per"           
[10] "inflation"          "vulnerable_emp"     "maternal_mortality"
[13] "exchange_rate"      "deportations"       "internet"          
[16] "poverty"            "dist_pop"           "dist_cap"          
[19] "terror"            

Now all column names are lowercase with underscores.


STEP 3: Missing Data Analysis

Count Missing Values by Column

Code
## Count NA values in each column
colSums(is.na(remit_train))
      country_name               year        remittances    remittances_gdp 
                 0                  0                  0                  0 
      country_code                gdp              stock       unemployment 
                 0                  0               1460                167 
           gdp_per          inflation     vulnerable_emp maternal_mortality 
                 0                 32                267                153 
     exchange_rate       deportations           internet            poverty 
                34                360                223               1949 
          dist_pop           dist_cap             terror 
               145                145                148 

Several variables have missing data. Let’s calculate the percentage!

Calculate Percentage of Missing Data

Code
## Calculate percent missing for each variable
remit_train |>
  summarise(across(everything(), ~sum(is.na(.)) / n() * 100))
# A tibble: 1 × 19
  country_name  year remittances remittances_gdp country_code   gdp stock
         <dbl> <dbl>       <dbl>           <dbl>        <dbl> <dbl> <dbl>
1            0     0           0               0            0     0  44.3
# ℹ 12 more variables: unemployment <dbl>, gdp_per <dbl>, inflation <dbl>,
#   vulnerable_emp <dbl>, maternal_mortality <dbl>, exchange_rate <dbl>,
#   deportations <dbl>, internet <dbl>, poverty <dbl>, dist_pop <dbl>,
#   dist_cap <dbl>, terror <dbl>

Major findings:

  • poverty: 64 % missing
  • stock:49.82.8% missing
  • remittances: % missing
  • remittances_gdp: 16.3% missing

STEP 4: Outlier Detection and Distribution Analysis

Examine Remittances (Our Target Variable)

Summary Statistics for Remittances

Code
## Get min, max, mean, median, quartiles
summary(remit_train$remittances)
        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
        6038     74764744    545314436   2655647852   2042426540 137674533896 

The mean ($2.55 billion) is much larger than the median ($529 million). The data is right-skewed.

Check overall data distribution

Code
## Create a point plot (from class notes Section 5.6.1)
remit_train |>
  select(remittances, remittances_gdp, gdp, stock, unemployment,deportations) |>
  pivot_longer(everything()) |>
  ggplot(aes(value)) +
  geom_histogram(bins = 30) +
  facet_wrap(~name, scales = "free") +
  theme_minimal()
Warning: Removed 1987 rows containing non-finite outside the scale range
(`stat_bin()`).

Code
#CAVEAT: I could not 'clean' the x axis (log would be the way to solve it but 
#we want to show how skewed the distribution is)

Point Plot to See Distribution

Code
## Create a point plot (from class notes Section 5.6.1)
remit_train |>
  ggplot(aes(remittances, 1)) +
  geom_point(alpha = 0.2) +
  scale_y_continuous(breaks = 0) +
  labs(y = NULL, title = "Distribution of Remittances") +
  theme_bw() +
  theme(panel.border = element_blank())

Most points cluster on the left (lower values) with a few extreme points on the right (confirms right-skewness).

Histogram to See Frequency Distribution

Code
## Create histogram with 30 bins
remit_train |>
  ggplot(aes(x = remittances)) +
  geom_histogram(bins = 30, fill = "steelblue") +
  theme_minimal() +
  labs(title = "Distribution of Remittances",
       x = "Remittances (USD)",
       y = "Count")

The histogram is heavily concentrated on the left with a long tail to the right. This is classic right-skewed data.

Boxplot to Identify Outliers

Code
## Create boxplot to see outliers
remit_train |>
  ggplot(aes(y = remittances)) +
  geom_boxplot(fill = "steelblue") +
  theme_minimal() +
  labs(title = "Boxplot of Remittances",
       y = "Remittances (USD)")

Many points appear above the upper whisker (outliers). These are likely large countries like Mexico that receive billions in remittances.

Examine GDP

Summary Statistics for GDP

Code
## Get summary statistics
summary(remit_train$gdp)
          Min.        1st Qu.         Median           Mean        3rd Qu. 
      37184925     6495301634    25703593810   303934455901   168683410868 
          Max. 
18316765021690 

GDP also shows huge range - from $21 million to $18.7 trillion.

Point Plot for GDP Distribution

Code
## Create point plot for GDP
remit_train |>
  ggplot(aes(gdp, 1)) +
  geom_point(alpha = 0.2) +
  scale_y_continuous(breaks = 0) +
  labs(y = NULL, title = "Distribution of GDP") +
  theme_bw() +
  theme(panel.border = element_blank())

GDP shows the same right-skewed pattern as remittances. Large economies have much higher GDP than small economies.

Examine Unemployment

Summary Statistics for Unemployment

Code
## Get summary statistics
summary(remit_train$unemployment)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.100   3.829   6.232   7.820  10.491  34.007     167 

Unemployment ranges from 0.11% to 34.01%.

Point Plot for Unemployment Distribution

Code
## Create point plot for unemployment
remit_train |>
  ggplot(aes(unemployment, 1)) +
  geom_point(alpha = 0.2) +
  scale_y_continuous(breaks = 0) +
  labs(y = NULL, title = "Distribution of Unemployment") +
  theme_bw() +
  theme(panel.border = element_blank())
Warning: Removed 167 rows containing missing values or values outside the scale range
(`geom_point()`).

Unemployment appears more evenly distributed than remittances or GDP.

Examine Inflation

Summary Statistics for Inflation

Code
## Get summary statistics
summary(remit_train$inflation)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
 -32.741    1.834    4.197   12.021    8.589 4800.532       32 

The maximum inflation is 6,041%! (outlier) Most inflation values are between 1.7% and 9%,

Point Plot for Inflation Distribution

Code
## Create point plot for inflation
remit_train |>
  ggplot(aes(inflation, 1)) +
  geom_point(alpha = 0.2) +
  scale_y_continuous(breaks = 0) +
  labs(y = NULL, title = "Distribution of Inflation") +
  theme_bw() +
  theme(panel.border = element_blank())
Warning: Removed 32 rows containing missing values or values outside the scale range
(`geom_point()`).

Data Quality Unit Tests

Test 1: Are All Remittances Positive?

Code
## Test that remittances > 0 when not missing
remit_train |>
  filter(!is.na(remittances)) |>
  verify(remittances > 0) |>
  summarise(mean_remittances = mean(remittances, na.rm = TRUE))
# A tibble: 1 × 1
  mean_remittances
             <dbl>
1      2655647852.

All remittances are positive. The mean is $2.55 billion.

Test 2: Are All Years in Valid Range?

Code
## Test that years are between 1994 and 2024
remit_train |>
  verify(year >= 1994 & year <= 2024) |>
  summarise(mean_year = mean(year))
# A tibble: 1 × 1
  mean_year
      <dbl>
1     2010.

All years are within the expected range.

Test 3: Are All Unemployment Values Valid?

Code
## Test that unemployment is between 0 and 100
remit_train |>
  filter(!is.na(unemployment)) |>
  verify(unemployment >= 0 & unemployment <= 100) |>
  summarise(mean_unemployment = mean(unemployment, na.rm = TRUE))
# A tibble: 1 × 1
  mean_unemployment
              <dbl>
1              7.82

All unemployment values are valid percentages (0-100%).


STEP 5: Log Transformations for Skewed Variables

Since remittances and GDP are highly right-skewed, we need to create and examine log.

Create Log-Transformed Variables

Code
## Create log-transformed versions of skewed variables
remit_train <- remit_train |>
  mutate(
    log_remittances = log(remittances + 1),
    log_gdp = log(gdp + 1)
  )

We add 1 before taking the log to handle any zero values (log(0) is undefined).

Examine Log-Transformed Remittances

Summary Statistics for Log Remittances

Code
## Get summary statistics for log remittances
summary(remit_train$log_remittances)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  8.706  18.130  20.117  19.744  21.437  25.648 

Histogram of Log Remittances

Code
## Create histogram for log-transformed remittances
remit_train |>
  filter(!is.na(log_remittances)) |>
  ggplot(aes(x = log_remittances)) +
  geom_histogram(bins = 30, fill = "darkgreen", color = "white") +
  theme_minimal() +
  labs(title = "Distribution of Log-Transformed Remittances",
       subtitle = "Much more normal distribution after log transformation",
       x = "Log(Remittances + 1)",
       y = "Count")

The log-transformed remittances show a much more normal distribution compared to the original right-skewed data.

Boxplot of Log Remittances

Code
## Create boxplot for log remittances
remit_train |>
  filter(!is.na(log_remittances)) |>
  ggplot(aes(y = log_remittances)) +
  geom_boxplot(fill = "darkgreen") +
  theme_minimal() +
  labs(title = "Boxplot of Log-Transformed Remittances",
       y = "Log(Remittances + 1)")

Fewer outliers visible after log transformation.

Examine Log-Transformed GDP

Summary Statistics for Log GDP

Code
## Get summary statistics for log GDP
summary(remit_train$log_gdp)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  17.43   22.59   23.97   24.13   25.85   30.54 

Histogram of Log GDP

Code
## Create histogram for log-transformed GDP
remit_train |>
  filter(!is.na(log_gdp)) |>
  ggplot(aes(x = log_gdp)) +
  geom_histogram(bins = 30, fill = "darkblue", color = "white") +
  theme_minimal() +
  labs(title = "Distribution of Log-Transformed GDP",
       subtitle = "More normal distribution after log transformation",
       x = "Log(GDP + 1)",
       y = "Count")

Log GDP also shows a more normal distribution.

Compare Original vs Log-Transformed

Side-by-Side: Original vs Log Remittances

Code
## Create comparison plots
p1 <- remit_train |>
  filter(!is.na(remittances)) |>
  ggplot(aes(x = remittances)) +
  geom_histogram(bins = 30, fill = "steelblue") +
  theme_minimal() +
  labs(title = "Original Remittances (Right-Skewed)",
       x = "Remittances (USD)")

p2 <- remit_train |>
  filter(!is.na(log_remittances)) |>
  ggplot(aes(x = log_remittances)) +
  geom_histogram(bins = 30, fill = "darkgreen") +
  theme_minimal() +
  labs(title = "Log-Transformed Remittances (More Normal)",
       x = "Log(Remittances + 1)")

grid.arrange(p1, p2, ncol = 2)

The log transformation successfully converts the right-skewed distribution into a more normal distribution, which is better for modeling.

Side-by-Side: Original vs Log GDP

Code
## Create comparison plots for GDP
p3 <- remit_train |>
  filter(!is.na(gdp)) |>
  ggplot(aes(x = gdp)) +
  geom_histogram(bins = 30, fill = "steelblue") +
  theme_minimal() +
  labs(title = "Original GDP (Right-Skewed)",
       x = "GDP (USD)")

p4 <- remit_train |>
  filter(!is.na(log_gdp)) |>
  ggplot(aes(x = log_gdp)) +
  geom_histogram(bins = 30, fill = "darkblue") +
  theme_minimal() +
  labs(title = "Log-Transformed GDP (More Normal)",
       x = "Log(GDP + 1)")

grid.arrange(p3, p4, ncol = 2)

Relationship: Log GDP vs Log Remittances

Code
## Scatter plot with log-transformed variables
remit_train |>
  filter(!is.na(log_gdp), !is.na(log_remittances)) |>
  ggplot(aes(x = log_gdp, y = log_remittances)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  theme_minimal() +
  labs(title = "Log GDP vs Log Remittances",
       subtitle = "Clearer linear relationship after log transformation",
       x = "Log(GDP + 1)",
       y = "Log(Remittances + 1)")
`geom_smooth()` using formula = 'y ~ x'

The relationship between log GDP and log remittances is more linear than the original variables, which will improve model performance.

Assessing deportations and remittances

Code
ggplot(remit_train, aes(x = deportations + 1, y = remittances + 1)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  scale_x_log10() +
  scale_y_log10() +
  labs(
    title = "Log–Log Relationship Between Deportations and Remittances",
    x = "Log(Deportations + 1)",
    y = "Log(Remittances + 1)"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 360 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 360 rows containing missing values or values outside the scale range
(`geom_point()`).

  • On a log–log scale, deportations and remittances show a positive but noisy association, suggesting remittances tend to rise with deportations.

STEP 6: Categorical Variables Analysis

Count Unique Countries

Code
## Count distinct country names
n_distinct(remit_train$country_name)
[1] 151

We have 158 different countries in the dataset.

View Frequency Table (First 20 Countries)

Code
## Show how many observations per country
head(table(remit_train$country_name), 20)

        Afghanistan             Albania             Algeria              Angola 
                 16                  25                  27                  12 
Antigua and Barbuda           Argentina             Armenia           Australia 
                 25                  22                  23                  21 
            Austria          Azerbaijan          Bangladesh            Barbados 
                 22                  21                  22                  25 
            Belarus             Belgium              Belize               Benin 
                 26                  22                  20                  26 
            Bermuda              Bhutan             Bolivia            Botswana 
                 16                  12                  29                  26 

Most countries have between 20-30 observations, representing roughly 20-30 years of data.

Create Bar Chart of Top 20 Countries

Code
## Count observations per country and plot top 20
remit_train |>
  count(country_name, sort = TRUE) |>
  slice_head(n = 20) |>
  ggplot(aes(x = reorder(country_name, n), y = n)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Top 20 Countries by Number of Observations",
       x = "Country",
       y = "Count")

Countries are fairly evenly represented. Bahrain has the most observations (30), while several countries have around 20-29 observations.

Create Bar Chart of Top 15 by total remittances (2024) & Top 15 by remittances/GDP (2024)

Code
remit_train %>%
  filter(year == 2024) %>%
  slice_max(remittances, n = 15) %>%
  ggplot(aes(
    x = fct_reorder(country_name, remittances),
    y = remittances / 1e9
  )) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Top 15 Remittance Receivers (2024)",
    x = "Country",
    y = "Remittances (Billions USD)"
  ) +
  theme_minimal()

Code
# Top 15 by remittances/GDP (2024)
remit_train %>%
  filter(year == 2024) %>%
  slice_max(remittances_gdp, n = 15) %>%   # preferred over top_n()
  mutate(`Country Name` = fct_reorder(country_name, remittances_gdp)) %>%
  ggplot(aes(x = `Country Name`, y = remittances_gdp)) +
  geom_col(fill = "coral") +
  coord_flip() +
  labs(
    title = "Top 15 Countries: Remittances as % GDP (2024)",
    x = "Country",
    y = "Remittances (% GDP)"
  ) +
  theme_minimal()

Top 10 countries - Remittances as a % of the GDP

Code
remit_train |>
  group_by(country_name) |>
  summarize(mean_ratio = mean(remittances_gdp, na.rm = TRUE)) |>
  arrange(desc(mean_ratio)) |>
  slice_head(n = 10)
# A tibble: 10 × 2
   country_name mean_ratio
   <chr>             <dbl>
 1 Lesotho            42.9
 2 Tonga              31.8
 3 Bermuda            21.1
 4 Nepal              20.2
 5 Samoa              19.6
 6 Lebanon            19.5
 7 El Salvador        17.9
 8 Kosovo             17.0
 9 Jordan             16.1
10 Honduras           15.1
Code
top10 <- remit_train |>
  group_by(country_name) |>
  summarize(mean_ratio = mean(remittances_gdp, na.rm = TRUE)) |>
  arrange(desc(mean_ratio)) |>
  slice_head(n = 10)

ggplot(top10, aes(x = reorder(country_name, mean_ratio), y = mean_ratio)) +
  geom_col(fill = "coral") +
  coord_flip() +
  labs(
    title = "Top 10 Countries by Average Remittances % of GDP",
    x = "Country",
    y = "Average Remittances/GDP"
  ) +
  theme_minimal()

Remittances per country (Original Scale)

Code
remit_train |>
  filter(country_name %in% c(
    "Nicaragua", "El Salvador", "Honduras", "Guatemala", "Haiti", "India"
  )) |>
  ggplot(aes(x = year, y = remittances_gdp, color = country_name)) +
  geom_line(linewidth = 1) +
  scale_y_log10() +
  labs(
    title = "Remittance Trends Over Time",
    subtitle = "Selected Countries (log scale)",
    x = "Year",
    y = "Remittances (% of GDP, log scale)",
    color = "Country"
  ) +
  theme_minimal()

Start assessing countries of interest

Code
#Start assessing countries of interes
countries_of_interest <- c( "Nicaragua", "El Salvador", "Honduras", "Guatemala", 
                            "Haiti", "India")

filtered <- remit_train |>
  filter(country_name %in% countries_of_interest)

ggplot(filtered, aes(log(stock), log(remittances), color = country_name)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "log(stock)",
    y = "log(remittances)",
    title = "Stock–Remittance Relationship for Selected Countries",
    color = "Country"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 13 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 13 rows containing missing values or values outside the scale range
(`geom_point()`).

Code
#Comment: comparing the stock–Remittance Relationship for Selected Countries

STEP 6: Relationships Between Variables

GDP vs Remittances (Original Scale)

Code
## Scatter plot with trend line
remit_train |>
  ggplot(aes(x = gdp, y = remittances)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  theme_minimal() +
  labs(title = "Relationship Between GDP and Remittances (Original Scale)",
       x = "GDP (USD)",
       y = "Remittances (USD)")
`geom_smooth()` using formula = 'y ~ x'

There is a clear positive relationship. Countries with larger economies (higher GDP) tend to receive more remittances in absolute dollar amounts. The red line shows the linear trend.

Log GDP vs Log Remittances (Better for Modeling)

Code
## Scatter plot with log-transformed variables
remit_train |>
  filter(!is.na(log_gdp), !is.na(log_remittances)) |>
  ggplot(aes(x = log_gdp, y = log_remittances)) +
  geom_point(alpha = 0.3, color = "darkgreen") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  theme_minimal() +
  labs(title = "Log GDP vs Log Remittances (Log Scale - Better Linear Fit)",
       subtitle = "This relationship is more appropriate for linear regression models",
       x = "Log(GDP + 1)",
       y = "Log(Remittances + 1)")
`geom_smooth()` using formula = 'y ~ x'

The log-transformed relationship is more linear and will produce better model predictions.

GDP Per Capita vs Remittances as % of GDP

Code
## Scatter plot with trend line
remit_train |>
  ggplot(aes(x = gdp_per, y = remittances_gdp)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  theme_minimal() +
  labs(title = "GDP Per Capita vs Remittances as % of GDP",
       x = "GDP Per Capita (USD)",
       y = "Remittances as % of GDP")
`geom_smooth()` using formula = 'y ~ x'

Poorer countries (lower GDP per capita) depend more heavily on remittances as a percentage of their economy. Richer countries receive remittances but they represent a smaller share of their total GDP.

Unemployment vs Remittances

Code
## Scatter plot with trend line
remit_train |>
  ggplot(aes(x = unemployment, y = remittances)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  theme_minimal() +
  labs(title = "Unemployment vs Remittances",
       x = "Unemployment Rate (%)",
       y = "Remittances (USD)")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 167 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 167 rows containing missing values or values outside the scale range
(`geom_point()`).

There is a slight negative relationship (it is weak). Unemployment doesn’t appear to be a strong predictor of remittances.

Correlation Matrix (Numbers)

Code
## Calculate correlations between all numeric variables
remit_train |>
  select(where(is.numeric)) |>
  cor(use = "complete.obs") |>
  round(2)
                    year remittances remittances_gdp   gdp stock unemployment
year                1.00        0.18            0.07  0.13 -0.01        -0.07
remittances         0.18        1.00            0.02  0.47  0.47        -0.12
remittances_gdp     0.07        0.02            1.00 -0.21  0.03         0.03
gdp                 0.13        0.47           -0.21  1.00  0.18        -0.09
stock              -0.01        0.47            0.03  0.18  1.00        -0.13
unemployment       -0.07       -0.12            0.03 -0.09 -0.13         1.00
gdp_per             0.22        0.02           -0.38  0.19 -0.09        -0.07
inflation          -0.13       -0.06            0.03 -0.12  0.00        -0.06
vulnerable_emp     -0.08        0.05            0.39 -0.10  0.06        -0.12
maternal_mortality -0.10        0.04            0.11 -0.11 -0.01        -0.19
exchange_rate       0.03        0.05           -0.05 -0.03 -0.05        -0.09
deportations       -0.09        0.28            0.23  0.01  0.76        -0.13
internet            0.68        0.06           -0.27  0.20 -0.08         0.02
poverty            -0.28        0.01            0.18 -0.11  0.04        -0.12
dist_pop            0.06        0.08           -0.09  0.13 -0.17        -0.06
dist_cap            0.06        0.09           -0.11  0.13 -0.19        -0.04
terror             -0.09        0.21            0.24  0.04  0.21        -0.12
log_remittances     0.26        0.69            0.13  0.36  0.30        -0.13
log_gdp             0.14        0.45           -0.59  0.60  0.19        -0.06
                   gdp_per inflation vulnerable_emp maternal_mortality
year                  0.22     -0.13          -0.08              -0.10
remittances           0.02     -0.06           0.05               0.04
remittances_gdp      -0.38      0.03           0.39               0.11
gdp                   0.19     -0.12          -0.10              -0.11
stock                -0.09      0.00           0.06              -0.01
unemployment         -0.07     -0.06          -0.12              -0.19
gdp_per               1.00     -0.30          -0.63              -0.35
inflation            -0.30      1.00           0.15               0.17
vulnerable_emp       -0.63      0.15           1.00               0.65
maternal_mortality   -0.35      0.17           0.65               1.00
exchange_rate        -0.18      0.07           0.27               0.24
deportations         -0.14      0.02           0.09               0.00
internet              0.72     -0.28          -0.60              -0.43
poverty              -0.44      0.25           0.71               0.74
dist_pop             -0.19      0.11           0.28               0.22
dist_cap             -0.15      0.10           0.25               0.21
terror               -0.58      0.27           0.57               0.39
log_remittances       0.11     -0.14          -0.01              -0.06
log_gdp               0.51     -0.16          -0.40              -0.23
                   exchange_rate deportations internet poverty dist_pop
year                        0.03        -0.09     0.68   -0.28     0.06
remittances                 0.05         0.28     0.06    0.01     0.08
remittances_gdp            -0.05         0.23    -0.27    0.18    -0.09
gdp                        -0.03         0.01     0.20   -0.11     0.13
stock                      -0.05         0.76    -0.08    0.04    -0.17
unemployment               -0.09        -0.13     0.02   -0.12    -0.06
gdp_per                    -0.18        -0.14     0.72   -0.44    -0.19
inflation                   0.07         0.02    -0.28    0.25     0.11
vulnerable_emp              0.27         0.09    -0.60    0.71     0.28
maternal_mortality          0.24         0.00    -0.43    0.74     0.22
exchange_rate               1.00        -0.04    -0.16    0.29     0.39
deportations               -0.04         1.00    -0.19    0.13    -0.23
internet                   -0.16        -0.19     1.00   -0.60    -0.14
poverty                     0.29         0.13    -0.60    1.00     0.23
dist_pop                    0.39        -0.23    -0.14    0.23     1.00
dist_cap                    0.37        -0.26    -0.11    0.21     1.00
terror                      0.20         0.21    -0.52    0.47     0.27
log_remittances             0.08         0.21     0.19   -0.06     0.03
log_gdp                     0.04         0.00     0.43   -0.27     0.10
                   dist_cap terror log_remittances log_gdp
year                   0.06  -0.09            0.26    0.14
remittances            0.09   0.21            0.69    0.45
remittances_gdp       -0.11   0.24            0.13   -0.59
gdp                    0.13   0.04            0.36    0.60
stock                 -0.19   0.21            0.30    0.19
unemployment          -0.04  -0.12           -0.13   -0.06
gdp_per               -0.15  -0.58            0.11    0.51
inflation              0.10   0.27           -0.14   -0.16
vulnerable_emp         0.25   0.57           -0.01   -0.40
maternal_mortality     0.21   0.39           -0.06   -0.23
exchange_rate          0.37   0.20            0.08    0.04
deportations          -0.26   0.21            0.21    0.00
internet              -0.11  -0.52            0.19    0.43
poverty                0.21   0.47           -0.06   -0.27
dist_pop               1.00   0.27            0.03    0.10
dist_cap               1.00   0.24            0.04    0.12
terror                 0.24   1.00            0.20   -0.08
log_remittances        0.04   0.20            1.00    0.53
log_gdp                0.12  -0.08            0.53    1.00
  • gdp and remittances: 0.46 - Moderate positive (as GDP increases, remittances increase)
  • log_gdp and log_remittances: Check the correlation matrix - Likely stronger correlation
  • stock and deportations: 0.73 - Strong positive (might cause multicollinearity issues)
  • dist_pop and dist_cap: 1.00 - Perfect correlation! These measure essentially the same thing. We must drop one.
  • internet and year: 0.69 - Strong positive (internet access increases over time)
  • gdp_per and vulnerable_emp: -0.63 - Strong negative (richer countries have less vulnerable employment)

Correlation Matrix (Heatmap)

Code
## Create visual correlation matrix (Eva's example)
remit_train |>
  select(where(is.numeric)) |>
  cor(use = "complete.obs") |>
  corrplot(method = "color", 
           type = "upper",
           tl.col = "black",
           tl.srt = 45,
           title = "Correlation Matrix",
           mar = c(0,0,2,0))

  • Blue squares = positive correlation (variables increase together)
  • Red squares = negative correlation (one increases, other decreases)
  • Darker colors = stronger correlation
Code
## Create visual correlation matrix (Gaby's example)
library(reshape2)

Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':

    smiths
Code
#Create variable to represent numeric vars 
numeric_vars_log <- remit_train %>%
  select(remittances_gdp, remittances, stock, unemployment, gdp_per, 
         inflation, internet, dist_cap, terror) %>%
  mutate(
    remittances_gdp = log1p(remittances_gdp),
    remittances     = log1p(remittances),
    stock            = log1p(stock),        # migrant stock
    gdp_per          = log1p(gdp_per),
    internet         = log1p(internet),
    dist_cap         = log1p(dist_cap)
  ) %>%
  na.omit()

cor_matrix_log <- cor(numeric_vars_log, use = "complete.obs")
melted_cor_log <- melt(cor_matrix_log)

ggplot(melted_cor_log, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2)), size = 3) +
  scale_fill_gradient2(
    low = "blue", mid = "white", high = "red",
    midpoint = 0, limits = c(-1, 1)
  ) +
  labs(title = "Correlation Heatmap (Log-Transformed Variables)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

  • Remittances are most strongly positively correlated with the stock of migrants (0.49), suggesting migrant presence is a key driver, while other macro variables show only weak associations.
  • Remittances as % of GDP are negatively correlated with GDP per capita ( −0.48), indicating remittances matter more (relative to GDP) in poorer countries.

STEP 8: Exploring lagged effects

It is likely that past changes to country of origin conditions is likely to be more illustrative of future remittances than current conditions. For example, while a downward shock in GDP may influence current migration, migrants may take time to settle into the US and thus begin remitting back home.

These lagged effect would seem to be the most plausible with GDP, unemployment, terror, deportations, and changes in migrant stock (inward migration)

Code
## Lagged (1 year)
remit_lag <- remit_train |>
  mutate(year = as.numeric(year)) |>
  arrange(country_name, year) |>
  group_by(country_name) |>
  mutate(
    gdp_lag = lag(gdp_per),
    unemp_lag = lag(unemployment), 
    terror_lag = lag(terror), 
    deportations_lag = lag(deportations), 
    stock_lag = lag(stock), 
  ) |>
  ungroup()
# Verifying that the lag worked. 
remit_lag |>
  select(country_name, year, gdp_per, gdp_lag) |>
  arrange(country_name, year) |>
  filter(!is.na(gdp_lag)) |>
  slice_head(n = 10)
# A tibble: 10 × 4
   country_name  year gdp_per gdp_lag
   <chr>        <dbl>   <dbl>   <dbl>
 1 Afghanistan   2009    452.    382.
 2 Afghanistan   2010    561.    452.
 3 Afghanistan   2011    607.    561.
 4 Afghanistan   2012    651.    607.
 5 Afghanistan   2013    637.    651.
 6 Afghanistan   2014    625.    637.
 7 Afghanistan   2015    566.    625.
 8 Afghanistan   2016    522.    566.
 9 Afghanistan   2017    525.    522.
10 Afghanistan   2018    491.    525.

Lagged GDP per Capita

Code
## Lagged predictors relationship with remittances (as % of GDP)

## GDP per capita
# Lagged vs Unlagged
remit_lag |>
  pivot_longer(cols = c(gdp_per, gdp_lag),
               names_to = "type",
               values_to = "value") |>
  ggplot(aes(value, remittances_gdp, color = type)) +
  geom_point(alpha = 0.3) +
  geom_smooth(se = FALSE) +
  theme_minimal() +
  labs(title = "Lagged vs Current GDP per Capita",
       x = "GDP per capita",
       color = "Variable") ## Doesn't Necessarily Improve Model Fit. 

Overall there seem to be better ways to verify whether lagged variable would improve the interpretability of our models.

Code
## Comparing Lagged vs Current GDP per capita for key countries. 
remit_lag |>
  filter(country_name %in% c("Nicaragua", "El Salvador", "Honduras",
                             "Guatemala", "Haiti", "India")) |>
  pivot_longer(
    cols = c(gdp_per, gdp_lag),
    names_to = "gdp_type",
    values_to = "gdp_value"
  ) |>
  ggplot(aes(x = gdp_value, y = remittances_gdp,
             color = country_name, linetype = gdp_type)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(
    title = "Lagged vs Current GDP per Capita",
    x = "GDP per capita (current or lagged)",
    linetype = "GDP variable"
  )

Suggestion is that Lagged GDP demonstrates a slightly stronger relationship and thus may improve model fit. Thus it may seem that shocks or changes to prior GDP could help explain current remittances amounts.

Lagged Unemployment

Code
## Comparing Lagged vs Current Unemployment for key countries. 
remit_lag |>
  filter(country_name %in% c("Nicaragua", "El Salvador", "Honduras",
                             "Guatemala", "Haiti", "India")) |>
  pivot_longer(
    cols = c(unemployment, unemp_lag),
    names_to = "unemp_type",
    values_to = "unemp_value"
  ) |>
  ggplot(aes(x = unemp_value, y = remittances_gdp,
             color = country_name, linetype = unemp_type)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(
    title = "Lagged vs Current Unemployment",
    x = "Unemployment (current or lagged)",
    linetype = "GDP variable"
  )

For most countries lagged unemployment does not seem to alter model fit substantially for any country other than Haiti.

It likely won’t improve our model fit and thus shouldn’t be included.

Lagged Changes in Terror Level

Code
## Comparing Lagged vs Current Terror for key countries. 
remit_lag |>
  filter(country_name %in% c("Nicaragua", "El Salvador", "Honduras",
                             "Guatemala", "Haiti", "India")) |>
  pivot_longer(
    cols = c(terror, terror_lag),
    names_to = "terror_type",
    values_to = "terror_value"
  ) |>
  ggplot(aes(x = terror_value, y = remittances_gdp,
             color = country_name, linetype = terror_type)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(
    title = "Lagged vs Current Terror",
    x = "Terror (current or lagged)",
    linetype = "Terror variable"
  )

It seems terror varies less, and lagged terror levels may not be too explanatory

Lagged Changes in Deportations

Code
## Comparing Lagged vs Current Deportations for key countries. 
remit_lag |>
  filter(country_name %in% c("Nicaragua", "El Salvador", "Honduras",
                             "Guatemala", "Haiti", "India")) |>
  pivot_longer(
    cols = c(deportations, deportations_lag),
    names_to = "deportations_type",
    values_to = "deportations_value"
  ) |>
  ggplot(aes(x = deportations_value, y = remittances_gdp,
             color = country_name, linetype = deportations_type)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(
    title = "Lagged vs Current Deportations",
    x = "Deportations (current or lagged)",
    linetype = "Deportations variable"
  )

Much stronger relationship for key countries in including the lagged effects of deportations in explaining future remittances.

Code
## Comparing Lagged vs Current Deportations for key countries. 
remit_lag |>
  filter(country_name %in% c("Nicaragua", "El Salvador", "Honduras",
                             "Guatemala", "Haiti", "India")) |>
  pivot_longer(
    cols = c(stock, stock_lag),
    names_to = "stock_type",
    values_to = "stock_value"
  ) |>
  ggplot(aes(x = stock_value, y = remittances_gdp,
             color = country_name, linetype = stock_type)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(
    title = "Lagged vs Current Changes in Migrant Stock ",
    x = "Migrant Stock (current or lagged)",
    linetype = "Migrant Stock variable"
  )

Less Strong change and thus probably doesn’t warrant inclusion.

Takeaways:

  • Predictive power of lagged deportations and GDP improve model fit the best (shift the slopes of our relationships most).

  • For the other predictors, including changes to migrant stock, terror, and unemployment the relationships for our key variables barely changed indicating no changes in explanatory power.

  • Next Steps: using step_mutate in our recipe to add lags to our gdp_per and deportations would account for this.